Skip to content

BUG-26988 implement replace for categorical blocks #27026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 16, 2019

Conversation

JustinZhengBC
Copy link
Contributor

The replace functions used in the code path that caused the bug report appeared to densify the series. This PR adds a replace function to CategoricalBlock that takes advantage of the categorical data type when possible

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

Merging #27026 into master will decrease coverage by 0.01%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27026      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.02%     
==========================================
  Files         180      180              
  Lines       50774    50790      +16     
==========================================
+ Hits        46712    46719       +7     
- Misses       4062     4071       +9
Flag Coverage Δ
#multiple 90.62% <68.75%> (-0.01%) ⬇️
#single 41.79% <56.25%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 94.17% <75%> (-0.22%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.89% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.84% <0%> (-0.11%) ⬇️
pandas/io/formats/format.py 97.91% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83fe8d7...a2f2a7c. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 25, 2019

Codecov Report

Merging #27026 into master will decrease coverage by 1.21%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27026      +/-   ##
==========================================
- Coverage   93.07%   91.86%   -1.22%     
==========================================
  Files         192      180      -12     
  Lines       49562    50833    +1271     
==========================================
+ Hits        46130    46697     +567     
- Misses       3432     4136     +704
Flag Coverage Δ
#multiple 90.5% <93.33%> (-1.33%) ⬇️
#single 41.84% <46.66%> (-0.67%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 95.03% <100%> (-1.54%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/computation/expressions.py 93.27% <0%> (-5.81%) ⬇️
pandas/compat/pickle_compat.py 63.88% <0%> (-4.72%) ⬇️
pandas/plotting/_matplotlib/timeseries.py 66.32% <0%> (-3.79%) ⬇️
pandas/core/computation/scope.py 93.13% <0%> (-3.74%) ⬇️
pandas/compat/__init__.py 92% <0%> (-2.88%) ⬇️
pandas/io/parquet.py 65.59% <0%> (-2.55%) ⬇️
pandas/io/formats/printing.py 86.72% <0%> (-2.49%) ⬇️
pandas/util/_decorators.py 92.13% <0%> (-2.44%) ⬇️
... and 182 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 04893a9...8667b0d. Read the comment docs.

@@ -585,6 +585,7 @@ Categorical

- Bug in :func:`DataFrame.at` and :func:`Series.at` that would raise exception if the index was a :class:`CategoricalIndex` (:issue:`20629`)
- Fixed bug in comparison of ordered :class:`Categorical` that contained missing values with a scalar which sometimes incorrectly resulted in ``True`` (:issue:`26504`)
- Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give unusual results on categorical data (:issue:`26988`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unusual" --> "incorrect"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -2978,6 +2978,29 @@ def where(self, other, cond, align=True, errors='raise',
axis=axis, transpose=transpose)
return result

def replace(self, to_replace, value, inplace=False, filter=None,
regex=False, convert=True):
if filter is None or not self.mgr_locs.isin(filter).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block looks unrelated to the linked issue. what case is it solving? needs a separate test?

Copy link
Contributor Author

@JustinZhengBC JustinZhengBC Jun 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first case is used when the replace function applies to the entire block, so the replace can be carried out just by manipulating categories. Most calls to Series.replace should be covered with this case (including the new test in series/test_replace.py). If there's a filter, this might not be possible so we fix the categories then use the original implementation (second case)

I added a test in frame/test_replace.py to cover the second case, where a filter is used. It improves upon the previous behaviour, which would have casted the frame to object

@JustinZhengBC
Copy link
Contributor Author

Failure unrelated #26842

@gfyoung gfyoung added Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Jun 26, 2019
# GH 26988
df = DataFrame([[1, 1], [2, 2]], columns=['a', 'b'], dtype='category')
expected = DataFrame(final_data, columns=['a', 'b'], dtype='category')
expected['a'].cat.set_categories([1, 2, 3], inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try not to use inplace in tests, its non-idiomatic.

@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Jul 8, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the merge conflict. LGTM.

@jbrockmendel
Copy link
Member

Are there tests for the non-None filter branch?

@JustinZhengBC
Copy link
Contributor Author

@jbrockmendel yes, the dataframe tests take that branch while the series tests do not

@jreback jreback removed this from the 0.25.0 milestone Jul 9, 2019
@jbrockmendel
Copy link
Member

I am -1 on this PR as is; this should be a method on a Categorical.

If it goes on Categorical, it should be generalized to EA.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2019

I am -1 on this PR as is; this should be a method on a Categorical.

If it goes on Categorical, it should be generalized to EA.

probably. But I think its much better then jamming things in the blocks (now we already have replace on the blocks, this PR might be ok actually), but I don't fully remember.

@TomAugspurger
Copy link
Contributor

I don't think we (or downstream EA authors) can reliably implement ExtensionArray.replace since it's such a complicated method.

We're also relying on Block.replace to do the heavy lifting. An ExtensionArray.replace wouldn't have that luxury.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2019

can you merge master

@jbrockmendel
Copy link
Member

will take a look this afternoon

@@ -88,6 +88,7 @@ Categorical
^^^^^^^^^^^

- Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`)
- Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give incorrect results on categorical data (:issue:`26988`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"func" -> "meth"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

cat.remove_categories(to_replace, inplace=True)
else:
index = categories.index(to_replace)
categories[index] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if value is already in categories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed with new case

@jbrockmendel
Copy link
Member

Only major question is whether we need to add replace to the EA interface.

inplace = validate_bool_kwarg(inplace, "inplace")
cat = self if inplace else self.copy()
categories = cat.categories.tolist()
if to_replace in categories:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if to_replace is not hashable? I think at the Categorical level we are OK raising, but at the Block level we would cast to object right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would casting to object help?

>>> import pandas as pd
>>> s=pd.Series([1,2]).astype(object)
>>> s.replace(1,[1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/series.py", line 4303, in replace
    method=method,
  File "pandas/core/generic.py", line 6819, in replace
    raise TypeError(msg)  # pragma: no cover
TypeError: Invalid "to_replace" type: 'int'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've merged master and fixed all comments except this one for the reason above

@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

this looked pretty good, can you merge master and address any remaining comments.

@jreback
Copy link
Contributor

jreback commented Oct 18, 2019

can you merge master

@jbrockmendel
Copy link
Member

@JustinZhengBC can you merge master. itd be nice to get this in

@JustinZhengBC
Copy link
Contributor Author

@jbrockmendel done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question about an error condition

@@ -2471,6 +2471,51 @@ def isin(self, values):
code_values = code_values[null_mask | (code_values >= 0)]
return algorithms.isin(self.codes, code_values)

def replace(self, to_replace, value, inplace: bool = False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these more specifically (can be a followon PR)

@JustinZhengBC
Copy link
Contributor Author

@jreback how should parameters like to_replace and value be typed when they can take on any type and might not even be the same type?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2019

@jreback how should parameters like to_replace and value be typed when they can take on any type and might not even be the same type?

right, but they are something like Union[Scalar, AnyArrayLike] I think? (again this is likely tricky so should do as a followup)

@TomAugspurger
Copy link
Contributor

Fixed the merge conflict.

I think we decided to leave typing as a followup, so this should be good to go?

result.values.add_categories(value, inplace=True)
return super(CategoricalBlock, result).replace(
to_replace, value, inplace, filter, regex, convert
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can any of the logic in this method generalize to ExtensionBlock? (presumably we'd need to add replace to the EA interface?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good point, but we can certainly do as a followup; can you open an issue for discussion.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

@JustinZhengBC if you can address @jbrockmendel comments to add some comments this lgtm and we can merge. (alternatvely @jbrockmendel could do as a followup)

@JustinZhengBC
Copy link
Contributor Author

@jreback @jbrockmendel I added the comment

@alimcmaster1
Copy link
Member

Seems like this is good to merge.
CC. @jreback @jbrockmendel

@jreback jreback merged commit fb08cee into pandas-dev:master Nov 16, 2019
@jreback
Copy link
Contributor

jreback commented Nov 16, 2019

thanks for the patch @JustinZhengBC very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Series[category].replace casts incorrectly
6 participants